Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

How are you feeling right now?

Apprehensive about the uncertainties of the future.

What do you expect to learn?

Basic R tasks.

Where did you hear about the course?

Sisu.

It seems the book has a lot of content, probably quite useful.

I’m familiar with Markdown but not with R.

Repo link: https://github.com/gartenfeld/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec  4 21:56:20 2023"

The text continues here.


Week 2: Wrangle & Analyse

This week we follow through the examples in Exercise Set 2 and learn about how to do linear regression and beyond with R.

A few words about the data set:

date()
## [1] "Mon Dec  4 21:56:20 2023"

Step 1: Read the data

library(readr)
data <- read_csv('data/learning_2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(data)
## [1] 166   7

The data contains 166 survey responses. Each response contains 7 columns (variables), namely: gender, age of the participant, attitude score (being the average of 10 attitude questions about attitudes towards learning about statistics), deep/surface/strategic metrics (the average of different categories of questions related to learning approaches or methods) – see more details in this document and the points from some exam or test.

So the general gist is – you may have guessed it – what influences study outcomes in statistics? Is it attitude towards learning? Is it approach and life hacks in learning? Let’s find out.

Step 2: Graphical overview

“Show a graphical overview”, a “plot of everything”:

library(GGally)
library(ggplot2)
p <- ggpairs(data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

The histograms (on the left-hand side) are quite useful for showing the distributions, we can see for example at a glance there are actually a lot of younger students who answered the survey, even though the first few rows have older people.

The box plots show how gender correlate with the other variables.

The scatter plots are not so easy for humans to eyeball the trends.

But, what’s most interesting for us may be the vertical strip of printouts where “points” correlate with other variables.

Step 3: Fit a model

Let’s pick the three largest values: attitude, stra, surf

p1 <- ggplot(data, aes(x = attitude, y = points))
p2 <- p1 + geom_point()
p2

Seems a bit all over the place. Let’s add a regression line:

p3 <- p2 + geom_smooth(method = "lm")
# Draw plot
p3
## `geom_smooth()` using formula = 'y ~ x'

Now, onto fitting a regression model with all three variables!

multi_model <- lm(points ~ attitude + stra + surf, data = data)
multi_model
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Coefficients:
## (Intercept)     attitude         stra         surf  
##     11.0171       3.3952       0.8531      -0.5861
summary(multi_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Step 4: Interpretation

Looking at the fitted coefficients (r-squareds), attitude has a big influence, stra has a much smaller influence, surf has a smaller still and negative influence.

Step 5: Graphical validation

# Residuals vs Fitted values (1)
p_a = plot(multi_model, which = 1)

# Normal QQ-plot (2)
p_b = plot(multi_model, which = 2)

# Residuals vs Leverage (5)
p_c = plot(multi_model, which = 5)

According to the internet, the assumptions of a linear model include:

  • Outcome variables have errors, X is fixed
  • Errors are independent
  • Errors distribute normally
  • Errors have constant variance
  • X to mean Y is linear

These are basically saying that the overall assumption of a linear model is that there is a linear relationship between the distribution of the outcome dots and the input 🤷‍♀️ Yeah, sounds tautological but a (sp)line only makes sense if a plausible underlying causal line exists… And when it does exist the dots tend to have a evenly spread shape.

The quantile-quantile looks pretty close to the prediction. Residuals seem quite evenly spread across, maybe slightly top-heavy. There are a bunch of points that have high leverage/influence on the outcome.


Week 3

This week we follow through the examples in Exercise Set 3.

date()
## [1] "Mon Dec  4 21:56:25 2023"

Step 1

Create this file and include it in index.Rmd.

Step 2

Read the joined student alcohol consumption data into R

library(readr)
data <- read_csv('data/drunk_kids.csv')

Print out the names of the variables in the data

names(data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Describe the data set briefly, assuming the reader has no previous knowledge of it

The data is about high-schoolers in Portugal and their grades. The two datasets are from two subjects: maths and Portugese. G1 and G2 are from the first two periods, G3 is the final grade. The questionnaire includes a bunch of socio-economic, behavioural, and other factors. In this exercise we wonder if binge drinking kids have worse study outcomes.

The purpose of your analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. To do this, choose 4 interesting variables in the data and for each of them, present your personal hypothesis about their relationships with alcohol consumption.

Choices of variables and reasoning:

  • sex: interesting to see if there’s a gender difference in consumption
  • romantic: cool kids drink more often and socialise more
  • famrel: kids with bad family relationship drink more
  • goout: kids go out with friends and drink

Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.

library(dplyr)
data %>% count(sex)
## # A tibble: 2 × 2
##   sex       n
##   <chr> <int>
## 1 F       195
## 2 M       175
library(ggplot2)

data_for_plot <- tidyr::gather(data, key = "timing", value = "combined_use", Dalc, Walc)


ggplot(data_for_plot, aes(x = sex, y = combined_use, fill = timing)) +
  geom_boxplot(position = "dodge", alpha = 0.2) +

  labs(
       x = "Sex",
       y = "Alcohol Consumption (5 = Very high)",
       fill = "Occasion",
  ) +

  scale_fill_manual(
    values = c("Dalc" = "lightblue", "Walc" = "lightgreen"),
    labels = c("Dalc" = "Weekday", "Walc" = "Weekend")
  )


Comments:

  • Boys drink more
  • Girls rarely drink during the week
  • There may be a significant difference between the sexes

library(ggplot2)

ggplot(data, aes(x = romantic, y = alc_use, fill = romantic)) +
  geom_boxplot(position = position_dodge(0.8), alpha = 0.8) +
  labs(title = "Alcohol Consumption by Sex and Romantic Relationship",
       x = "Sex",
       y = "Alcohol Consumption",
       fill = "Romantic") +
  scale_fill_manual(values = c("yes" = "pink", "no" = "lightgrey"),
                    name = "In Relationship",
                    labels = c("yes" = "Yes", "no" = "No")) +
  facet_grid(. ~ sex)


Comments:

  • Alcohol makes no difference in girls
  • It also makes almost no difference in boys
  • Therefore one cannot drink one’s way to pair bonding

library(ggplot2)

ggplot(data, aes(x = famrel, y = alc_use, color = sex)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Alcohol Consumption and Family Relationship",
       x = "Family Relationship",
       y = "Alcohol Consumption",
       color = "Sex") +
  scale_color_manual(values = c("M" = "deepskyblue", "F" = "pink"),
                     name = "Sex")


Comments:

  • For boys, the trend seems more noticeable, that one drinks less in a good family
  • For girls the line seems pretty flat, that all family environments have relatively low drinking

library(ggplot2)

ggplot(data, aes(x = goout, y = alc_use, color = sex)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Alcohol Consumption and Going Out",
       x = "Going Out with Friends",
       y = "Alcohol Consumption",
       color = "Sex") +
  scale_color_manual(values = c("M" = "deepskyblue", "F" = "pink"),
                     name = "Sex")


Comments:

  • Quite obvious trends of going out involving drinking

Step 5

Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.

alco_predictor <- glm(high_use ~ famrel + goout, data = data, family = "binomial")

Present and interpret a summary of the fitted model.

summary(alco_predictor)
## 
## Call:
## glm(formula = high_use ~ famrel + goout, family = "binomial", 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9404     0.6179  -3.140  0.00169 ** 
## famrel       -0.3891     0.1350  -2.882  0.00395 ** 
## goout         0.7923     0.1193   6.640 3.14e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 394.65  on 367  degrees of freedom
## AIC: 400.65
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

  • Family relationship goodness has a significant negative influence on high use
  • Going out with friends has a significant positive influence on high use

Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.

coef(alco_predictor)
## (Intercept)      famrel       goout 
##  -1.9403921  -0.3890518   0.7923023

Interpretation:

  • An increase in goout is associated with an increase in the log-odds of high_use, while an increase in famrel is associated with a decrease in the log-odds of high_use.

Step 6

Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions.

alco_predictor <- glm(high_use ~ famrel + goout, data = data, family = "binomial")
probs <- predict(alco_predictor, type = "response")

library(dplyr)
forwards_data <- mutate(data, probability = probs)
forwards_data <- mutate(forwards_data, prediction = probability > 0.5)

table(high_use = forwards_data$high_use, prediction = forwards_data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   233   26
##    TRUE     64   47

Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results.

TP <- 47
TN <- 233
FP <- 26
FN <- 64
training_error <- (FP + FN) / (TP + TN + FP + FN)
training_error
## [1] 0.2432432

Compare the performance of the model with performance achieved by some simple guessing strategy.

So it gets a quarter of the cases wrong. Personally, I’m not sure what is a “simple guessing strategy”, it’s better than a coin toss, but given the input of family relationship and going out, plus some human intuition, guessing could have fared better or worse. But 75+% correct classification seems to be a rather good first-try result in machine learning learning land.

Step 7

Perform 10-fold cross-validation on your model.

data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ famrel + goout, data = data, family = "binomial")

# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Perform cross-validation using caret
cv_results <- train(high_use ~ famrel + goout, data = data, method = "glm", family = "binomial", trControl = ctrl)

# View cross-validation results
cv_results
## Generalized Linear Model 
## 
## 370 samples
##   2 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 332, 333, 333, 333, 333, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7299036  0.2589837

The prediction is 74% accurate and quite a bit better than chance.

The performance of this model is on par with the “model introduced in the Exercise Set (which had about 0.26 error)”.

Step 8

Perform cross-validation to compare the performance of different logistic regression models (= different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to models with less predictors. Draw a graph displaying the trends of both training and testing errors by the number of predictors in the model.

Noniin, let’s start with a “very high number of predictors”.

data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + freetime + health + failures + paid + absences + G1 + G2 + G3 + famrel + goout, data = data, family = "binomial")

# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Perform cross-validation using caret
cv_results <- train(high_use ~ address + famsize + Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + traveltime + studytime + schoolsup + famsup + activities + nursery + higher + internet + romantic + freetime + health + failures + paid + absences + G1 + G2 + G3 + famrel + goout, data = data, method = "glm", family = "binomial", trControl = ctrl)

# View cross-validation results
cv_results
## Generalized Linear Model 
## 
## 370 samples
##  28 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 332, 334, 333, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7432077  0.3392237

Not much better…

Now, with hand-picked variables:

data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ address + Pstatus + higher + absences + famrel + goout, data = data, family = "binomial")

# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Perform cross-validation using caret
cv_results <- train(high_use ~ address + Pstatus + higher + absences + famrel + goout, data = data, method = "glm", family = "binomial", trControl = ctrl)

# View cross-validation results
cv_results
## Generalized Linear Model 
## 
## 370 samples
##   6 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 333, 333, 333, 333, 334, 333, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7402047  0.3100473

After manually trying a bunch of combos, it doesn’t seem to get much better than this… due to the random round-robin cross-validation, the value also jumps up and down making it hard to eyeball whether it’s better or worse.

But wait, what about…

data$high_use <- as.factor(data$high_use)
alco_predictor <- glm(high_use ~ Walc + Dalc, data = data, family = "binomial")

# Run `install.packages('caret', dependencies = TRUE)` beforehand
library(caret)
# Create a training control object for 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Perform cross-validation using caret
cv_results <- train(high_use ~ Walc + Dalc, data = data, method = "glm", family = "binomial", trControl = ctrl)

# View cross-validation results
cv_results
## Generalized Linear Model 
## 
## 370 samples
##   2 predictor
##   2 classes: 'FALSE', 'TRUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 332, 333, 333, 333, 333, 334, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   1         1

Well, statistical truths are often tautological.


Week 4

This week we follow through the examples in Exercise Set 4.

date()
## [1] "Mon Dec  4 21:56:29 2023"

Step 1

Create this file and include it in index.Rmd.

Step 2

“Load the Boston data from the MASS package”

# Load MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

“Explore the structure and the dimensions of the data”

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

“Describe the dataset briefly, assuming the reader has no previous knowledge of it.”

The dataset comes from a study by Harrison and Rubinfeld (1978) on house prices in Boston suburbs. It’s commonly used for teaching stats (specifically regression analysis, often called “machine learning” these days) by examples of predicting real estate prices from a number of demographic and socio-economic factors. The dataset contains 506 observations (rows) and 13 predictor variables plus the dependent variable (medv, median value of owner-occupied homes). See documentation.

Step 3

“Show a graphical overview of the data”

library(GGally)
library(ggplot2)
library(dplyr)
p <- ggpairs(Boston)
p

cor_matrix <- cor(Boston) %>% round(digits = 2)
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle", diag = FALSE, type = 'lower')

“Show summaries of the variables in the data”

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

“Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.

Looking at the bottom row medv as an example, we can see that rm (“average number of rooms per dwelling”) is strongly positively correlated with medv and that lstat (“lower status of the population in %”) is strongly negatively correlated. This is not surprising simply from reading the variable descriptions, that is, they are re-stating intuitive claims such as “larger houses cost more”, “poor people live in cheap neighbourhoods”.

Step 4

“Standardize the dataset and print out summaries of the scaled data.”

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

“How did the variables change?”

They are now how many standard deviations either side, general 0 to 3, 3+ being the extremes.

“Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate).”

boston_scaled <- as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)

“Use the quantiles as the break points in the categorical variable.”

quantile_labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, labels = quantile_labels, include.lowest = TRUE)
boston_scaled <- data.frame(boston_scaled, crime)

“Drop the old crime rate variable from the dataset.”

boston_scaled <- dplyr::select(boston_scaled, -crim)

“Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.”

num_observations <- nrow(boston_scaled)
sample_ids <- sample(num_observations,  size = num_observations * 0.8)
train_set <- boston_scaled[sample_ids,]
test_set <- boston_scaled[-sample_ids,]

Step 5

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

“Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.”

# `crime` as the target variable, and "." (all the others) as predicators
lda.fit <- lda(crime ~ ., data = train_set)
classes <- as.numeric(train_set$crime)

“Draw the LDA biplot.”

plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Step 6

“Save the crime categories from the test set”

correct_crime_level_predictions <- test_set$crime

“Remove the categorical crime variable from the test dataset.”

test_set <- dplyr::select(test_set, -crime)

“Then predict the classes with the LDA model on the test data.”

lda.pred <- predict(lda.fit, newdata = test_set)

“Cross tabulate the results with the crime categories from the test set.”

table(correct = correct_crime_level_predictions, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      11        3    0
##   med_low    1      19        4    0
##   med_high   0       8       17    3
##   high       0       0        1   24

“Comment on the results.”

Low crime areas are vert accurately predicted, 19 are really low, the other 1 medium low. Similar comments for high crime predictions, 26 spot on, 3 in adjacent category. The two categories in the middle got sometimes more mixed up either way, around 2/3 of the time correct.

Step 7

“Reload the Boston dataset.”

library(MASS)
data("Boston")

“Scale the variables to get comparable distances.”

boston_scaled <- scale(Boston)

“Calculate the distances between the observations.”

dist_eu <- dist(boston_scaled)

“Run k-means algorithm on the dataset.”

km <- kmeans(boston_scaled, centers = 4)

“Investigate what is the optimal number of clusters and run the algorithm again.”

set.seed(42)

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# Spot the cliff
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

“Visualize the clusters”

km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

“Interpret the results”

The idea of clustering observations into groups is predicated on the notion that there are natural or hidden categories in the observations, namely, we could ask “how many types of suburbs are there?” The grouping is easily spotted visually when – for simplicity’s sake, let’s imagine two classes – two classes have metrics that are well away from each other. We can then see on a scatter plot that one group is over here in one corner, and the other group is way over there in an opposite corner. In terms referred to in the above steps, that there is a great distance between those two clouds. Here we have more than 2 dimensions, we have a dozen or so variables, but that’s not a problem for the machine, similar principles apply – that is, we try to find groups that have centres away from each other, all these dimensions considered. The chart is a bit hard to read but upon zooming it you could see that sometimes the groups are more distinct.

Extra

“Perform k-means on the original Boston data with some reasonable number of clusters (> 2).”

km <- kmeans(boston_scaled, centers = 3)

“Then perform LDA using the clusters as target classes.”

cluster <- km$cluster
boston_scaled <- data.frame(boston_scaled, cluster)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv            cluster     
##  Min.   :-1.5296   Min.   :-1.9063   Min.   :1.000  
##  1st Qu.:-0.7986   1st Qu.:-0.5989   1st Qu.:1.000  
##  Median :-0.1811   Median :-0.1449   Median :2.000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   :1.941  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683   3rd Qu.:3.000  
##  Max.   : 3.5453   Max.   : 2.9865   Max.   :3.000

“Visualize the results with a bi-plot.”

lda.fit <- lda(cluster ~ ., data = boston_scaled[, -4])
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

> “Interpret the results. Which variables are the most influential linear separators for the clusters?”

I would say indus (proportion of non-retail business acres per town) is useful for separating Group 3 out, and tax full-value property-tax rate per $10,000 is useful for distinguishing Group 1 and 2.


Thank you and see you next week


Week 5

This week we follow through the examples in Exercise Set 5.

date()
## [1] "Mon Dec  4 21:56:43 2023"

Analysis

Step 0

Create this file and include it in index.Rmd.

Step 1

library(readr)
human <- read_csv("data/human.csv", show_col_types = FALSE)

“Move the country names to rownames”

library(tibble)
human <- column_to_rownames(human, "Country")

“Show a graphical overview of the data”

library(GGally)
library(ggplot2)
ggpairs(human, progress = FALSE)

“Show summaries of the variables in the data”

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
library(corrplot)
cor(human)
##                Edu2.FM      Labo.FM   Life.Exp     Edu.Exp         GNI
## Edu2.FM    1.000000000  0.009564039  0.5760299  0.59325156  0.43030485
## Labo.FM    0.009564039  1.000000000 -0.1400125  0.04732183 -0.02173971
## Life.Exp   0.576029853 -0.140012504  1.0000000  0.78943917  0.62666411
## Edu.Exp    0.593251562  0.047321827  0.7894392  1.00000000  0.62433940
## GNI        0.430304846 -0.021739705  0.6266641  0.62433940  1.00000000
## Mat.Mor   -0.660931770  0.240461075 -0.8571684 -0.73570257 -0.49516234
## Ado.Birth -0.529418415  0.120158862 -0.7291774 -0.70356489 -0.55656208
## Parli.F    0.078635285  0.250232608  0.1700863  0.20608156  0.08920818
##              Mat.Mor  Ado.Birth     Parli.F
## Edu2.FM   -0.6609318 -0.5294184  0.07863528
## Labo.FM    0.2404611  0.1201589  0.25023261
## Life.Exp  -0.8571684 -0.7291774  0.17008631
## Edu.Exp   -0.7357026 -0.7035649  0.20608156
## GNI       -0.4951623 -0.5565621  0.08920818
## Mat.Mor    1.0000000  0.7586615 -0.08944000
## Ado.Birth  0.7586615  1.0000000 -0.07087810
## Parli.F   -0.0894400 -0.0708781  1.00000000
library(corrplot)
correlation_matrix <- cor(human) %>% round(digits = 2)
corrplot(correlation_matrix, method="circle", diag = FALSE, type = 'lower')

ggplot(human, aes(x = GNI)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "black", alpha = 0.7)

“Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them”

What can we say about the distributions of the variables? It is somewhat reminiscent of Hans Rosling’s lectures… And because the hand-picked nature of our basket of variables, it is not surprising the scatter plots show some noticeable patterns. For GNI for example we can see there are a few very rich countries, a bunch of middle-income countries, then many poor countries.

Going from the Rosling perspective (public health and human development), we can see for example that Life Expectancy and Maternal Mortality are strongly correlated, with both being influence by health infrastructure etc. Life Expectancy also correlates with availability of education, which go hand in hand with economic development. We can also see that more women being educated correlate to lower Maternal Mortality as well as teenage birth.

Another interesting distribution is GNI / Parli.F, which shows that although some countries are quite rich, they didn’t become more equal, as the scatter shows a fairly normal spread across income levels (side ways), in fact there are a few outliers of very rich countries with almost no female parliamentary representation.

We can see from the correlation matrix that the variables we have selected have a general tendency to overlap with each other strongly, either positively or negatively, not necessarily because they are synonymous or antonymous with each other, but because they reflect different dependent facets of some shared underlying influence or condition. The weak circles (low correlations) are also interesting, for example, improved health development doesn’t translate to women working more.

Step 2

“Perform principal component analysis (PCA) on the raw (non-standardized) human data”

pca_human <- prcomp(human)

“Show the variability captured by the principal components”

summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000

“Draw a biplot displaying the observations by the first two principal components along with arrows representing the original variables”

biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

Step 3

“Standardize the variables in the human data and repeat the above analysis”

human_standardised <- scale(human)
pca_human_standardised <- prcomp(human_standardised)
summary(pca_human_standardised)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
biplot(pca_human_standardised, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

“Interpret the results of both analysis (with and without standardizing)”

“Are the results different? Why or why not?”

The results are very different before and after scaling. The unscaled plots are virtually unreadable. This is more or less expected, because of vastly differing units: GNI is in some kind of dollars so it can get to 200k, where as a ratio is sub-one. The biplot has GNI squashed to a horizontal arrow, so that PC1 repeats GNI.

Interpreting the scaled biplot is a lot more sensible. We can see that Life Expectancy, years of educations a child can expect, also female-to-male ratio of achieving secondary education are all correlated with each other and with money, and they are negatively correlated with Maternal Mortality and teenage births. All of these variables have arrows that are close to parallel to PC1.

Whereas, women in parliament and in workforce are correlated, and close to parallel with PC2.

See more comments below.

“Include captions (brief descriptions) in your plots”

biplot(pca_human_standardised, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
legend("bottomright", legend = c(
"Life.Exp: Life Expectancy ",
"Edu.Exp: Expected Years of Education ",
"GNI: Gross National Income",
"Mat.Mor: Maternal Mortality Ratio",
"Ado.Birth: Adolescent Birth Rate",
"Parli.F: Percentage of Women in Parliament",
"Edu2.FM: Female-to-Male Ratio of Secondary Education",
"Labo.FM: Female-to-Male Ratio of Labor Force Participation"
), cex = 0.2)

Step 4

“Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data”

The principal components can sometimes be hidden factors that do not directly map to a known concept, or sometimes to a very abstract quality or measure that’s hard to pin-point, and we cannot over-generalise that it’s “richness” or “goodness” or “development”. One can only make rough guesses what kinds of qualities the axes may correspond to, based on heuristics (namely generalisation from pre-loaded perception).

In the biplot, the points (labels of country names) are observations mapped along the 2 main PCs. Points close to each other share similar values as measured by the PC axes. Here, using our preconceived priors as heuristics we can gauge, ah okay, Qatar is on one side and Afghanistan on the other side, so this axis has something to do with rich vs. poor. Then you see the Nordics up there and Iran down there, you wonder if that axis has something to do with the permissiveness towards certain things, which could be bacon for all we know. Here I must apologise for my geographical ignorance that I really don’t know what it’s qualitatively and subjective like in those poorer countries who happen to be quite high up in terms of PC2. Kigali is a vibrant start-up scene with hipster coffeeshops nowadays, as rumours say, which could relate to the UK government’s interest in sending people there.

If two arrows are radially close to each other (that is, having similar clock-angles on the chart), it means the two original variables co-vary more closely – if they point the same way they co-vary positively, if they point almost 180-degree opposite ways, they co-vary negatively.

We can think of the PC axes also like the arrows, if an arrow is close to parallel with an axes, its corresponding original variable is more aligned with that PC. The size of the arrow indicates how much of a PC is explained by that original variable.

Step 5

“Load the tea dataset and convert its character variables to factors”

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

“Look at the structure and the dimensions of the data”

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

“Use View(tea) to browse its contents, and visualize the data”

# View(tea)
# Run this for one's own viewing leisure

“Use Multiple Correspondence Analysis (MCA) on the tea data (or a subset of columns)”

library(FactoMineR)
subset_tea <- tea[, c("Tea", "How", "how", "sugar")]
mca <- MCA(subset_tea, graph = FALSE)

“Interpret the results of the MCA”

summary(mca)
## 
## Call:
## MCA(X = subset_tea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.338   0.292   0.272   0.261   0.242   0.227   0.208
## % of var.             16.924  14.598  13.609  13.031  12.088  11.337  10.389
## Cumulative % of var.  16.924  31.522  45.131  58.162  70.250  81.587  91.977
##                        Dim.8
## Variance               0.160
## % of var.              8.023
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  |  0.102  0.010  0.008 |  0.251  0.072  0.046 | -0.070
## 2                  |  0.521  0.267  0.128 |  0.730  0.609  0.251 | -0.622
## 3                  | -0.045  0.002  0.003 | -0.227  0.059  0.074 | -0.322
## 4                  | -0.625  0.385  0.534 | -0.109  0.014  0.016 | -0.130
## 5                  | -0.045  0.002  0.003 | -0.227  0.059  0.074 | -0.322
## 6                  | -0.045  0.002  0.003 | -0.227  0.059  0.074 | -0.322
## 7                  | -0.045  0.002  0.003 | -0.227  0.059  0.074 | -0.322
## 8                  |  0.521  0.267  0.128 |  0.730  0.609  0.251 | -0.622
## 9                  |  0.057  0.003  0.002 |  0.343  0.134  0.063 | -0.320
## 10                 |  0.946  0.881  0.533 |  0.104  0.012  0.006 |  0.100
##                       ctr   cos2  
## 1                   0.006  0.004 |
## 2                   0.474  0.182 |
## 3                   0.127  0.149 |
## 4                   0.021  0.023 |
## 5                   0.127  0.149 |
## 6                   0.127  0.149 |
## 7                   0.127  0.149 |
## 8                   0.474  0.182 |
## 9                   0.125  0.055 |
## 10                  0.012  0.006 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   1.161  24.547   0.441  11.485 |   0.745  11.739   0.182
## Earl Grey          |  -0.532  13.455   0.511 -12.358 |  -0.031   0.054   0.002
## green              |   0.509   2.107   0.032   3.096 |  -1.489  20.894   0.274
## alone              |   0.124   0.736   0.028   2.918 |  -0.476  12.611   0.421
## lemon              |  -0.691   3.880   0.059  -4.201 |   0.295   0.822   0.011
## milk               |  -0.252   0.987   0.017  -2.249 |   0.817  12.002   0.177
## other              |   1.617   5.795   0.081   4.918 |   3.511  31.670   0.381
## tea bag            |  -0.349   5.092   0.159  -6.897 |   0.140   0.949   0.026
## tea bag+unpackaged |   0.264   1.609   0.032   3.080 |   0.078   0.165   0.003
## unpackaged         |   0.958   8.142   0.125   6.120 |  -0.865   7.695   0.102
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                7.376 |   0.182   0.748   0.011   1.798 |
## Earl Grey           -0.724 |   0.057   0.190   0.006   1.316 |
## green               -9.054 |  -0.739   5.516   0.067  -4.492 |
## alone              -11.217 |  -0.096   0.545   0.017  -2.251 |
## lemon                1.796 |   2.116  45.221   0.553  12.861 |
## milk                 7.283 |  -0.846  13.801   0.190  -7.541 |
## other               10.677 |   0.234   0.150   0.002   0.711 |
## tea bag              2.766 |  -0.439  10.042   0.252  -8.685 |
## tea bag+unpackaged   0.917 |   0.316   2.869   0.045   3.688 |
## unpackaged          -5.526 |   1.250  17.214   0.213   7.980 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.543 0.382 0.070 |
## How                | 0.154 0.667 0.650 |
## how                | 0.201 0.103 0.328 |
## sugar              | 0.456 0.016 0.040 |

“Draw the variable biplot of the analysis”

plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

“Comment on the output of the plots”

I can comment that it is indeed news to me that Earl Grey can also go with lemon. It’s quite expected that green tea does not go with milk like Earl Grey or black tea does, but I’m curious to find out what that “other” category might entail. Honey, perhaps? Also it’s curious what tea bag plus unpackaged means, perhaps it’s sometimes this sometimes that, but you can also see that for green tea drinkers, it’s more easily a snobbery thing where drinking loose leaves is perceived as somehow superior than just dunking a pre-manufactured bag in hot water.


(more chapters to be added similarly as we proceed with the course!)